The objective of this project is to analyse the crime situation in the city of San Francisco. This city is known for its cultural richness as well as its economic strength. A lot of our current media landscape seems to be filled with stories of violence in inner cities, so we will start off by looking at the trends of crime over time, with data from the last 20 years. Our analysis will focus on the interactions of crime with three different phenomena: socio-demographic characteristics, transportation and the COVID-19 crisis. We have an intuition that interconnectedness and density play a significant role for all of these different elements, and we will see where the empirical data leads us. Having collected data on the different socio-economic characteristics of the city’s 41 neighborhoods (age, income, educational attainment etc.), we will attempt to predict the neighborhoods with the highest crime rates based these variables. We will also look at possible correlations between public transportation density and crime. Finally, we will focus on whether the current pandemic and the associated lockdowns had any effect on crime rates. One of us took a criminology course during his bachelor’s degree and was deeply interested in finding explanations for crime. Seeking to understand the why of crime is research that could be really useful for other studies and there is still a lot to be discovered.
First, we will load two data sets that record the incident reports that have been filed to the police.
Source of the data set: [https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783]
This dataset covers the period from the 1st January 2018 to the day in 2020 where we’ve downloaded the data and contains about 408K observations.
We will focus on these variables:Incident Date, Incident Time, Incident Day of Week, Incident Category, Resolution, Analysis Neighborhood, point.
Incident Datetime The date and time when the incident occurredIncident Date The date the incident occurredIncident Time The time the incident occurredIncident Year The year the incident occurred, provided as a convenience for filteringIncident Day of WeekThe day of week the incident occurredReport Datetime Distinct from Incident Datetime, Report Datetime is when the report was filed.Incident Category A category mapped on to the Incident Code used in statistics and reporting.Incident Subcategory A subcategory mapped to the Incident Code that is used for statistics and reporting.Incident Description The description of the incident that corresponds with the Incident Code.Resolution The resolution of the incident at the time of the report.Analysis Neighborhood This field is used to identify the neighborhood where each incident occurs. Neighborhoods and boundaries are defined by the Department of Public Health and the Mayor’s Office of Housing and Community Development. Please reference the link below for additional info: [https://data.sfgov.org/d/p5b7-5n3h].Latitude The latitude coordinateLongitude The longitude coordinatepoint The point geometry used for mapping features in the open data portal platform. Latitude and Longitude are provided separately as well as a convenience.
In order to be able to join the two datasets that have a different temporality and thus focus on variables to answer the chosen research questions, we have removed a few columns.
This information will be useful to us in order to look at the evolution of crime over time and also to analyze crime by neighborhood and thus be able to analyze whether this variation in crime across neighborhoods can be explained by socio-economic variables.
Source of the data set: [https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry]
This dataset covers the period from the 1st January 2003 to the 15th May 2018 and contains about 2.16M observations
As for the 3.1.1 data set, we will focus on these variables:Incident Category, DayOfWeek, DayOfWeek, Date, Time, Resolution, Y, X, location,Analysis Neighborhoods 2 2.
Incident Category A category mapped on to the Incident Code used in statistics and reporting.Descript The description of the incident that corresponds with the Incident Code.DayOfWeek The day of week the incident occurredDate The date the incident occurredTime The time the incident occurredResolution The resolution of the incident at the time of the report.Y The latitude coordinateX The longitude coordinatelocation The point geometry used for mapping features in the open data portal platform. Latitude and Longitude are provided separately as well as a convenience.Analysis Neighborhoods 2 2 This field is used to identify the neighborhood where each incident occurs. Neighborhoods and boundaries are defined by the Department of Public Health and the Mayor’s Office of Housing and Community Development. Please reference the link below for additional info: [https://data.sfgov.org/d/p5b7-5n3h].For the Analysis Neighborhoods 2 2, we had to harmonize the numbers found in this column with the neighborhood names and make sure that we had the correct ones.
Then, we will load four data sets that record the evolution of some socio-economic variables for each neighborhood of San Francisco. As these reports - built with census data - were only exported in pdf format, we decided to create our own data set in csv in order to use the information from these reports for different periods.
Source of the data set: [https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2012-2016_ACS_Profile_Neighborhoods_Final.pdf]
#> Warning: Missing column names filled in: 'X39' [39]
Source of the data set: [https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2011-2015_ACS_Profile_Neighborhoods_Final.pdf]
Source of the data set: [https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2010-2014_ACS_Profile_Neighborhoods_v3AH.pdf]
Source of the data set: [https://sf-planning.org/sites/default/files/FileCenter/Documents/8501-SFProfilesByNeighborhoodForWeb.pdf]
Column of the data sets:
NEIGHBOORHOD Each neighborhood of San Franciso. Segmented using the geospatial data in this link : [https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h]Time Horizon This column highlights the time horizon where the data have been collected.Total Population This column highlights the total population for each neighbourhood.Households This column highlights the total households number for each neighbourhood.Family Households This column highlights the proportion of family-households within a neighborhood.Non-Family Households This column highlights the proportion of non-family-households within a neighbourhoodAverage Household Size This column highlights the average household size for each neighbourhood.Asian proportion of Asian people within the neighborhoodBlack/African American proportion of Black/African American people within the neighborhoodWhite proportion of White people within the neighborhoodNative American Indian proportion of Native American Indian people within the neighborhoodNative Hawaiian/Pacific Islander proportion of Native Hawaiian/Pacific Islander people within the neighborhoodOther/Two or More Races proportion of Other/Two or More Races people within the neighborhoodLatino (of Any Race) proportion of Latino (of Any Race) people within the neighborhood0-4 years proportion of people with age = [0-4] within the neighborhood5-17 years proportion of people with age = [5-17] within the neighborhood18-34 years proportion of people with age = [18-34] within the neighborhood35-59 years proportion of people with age = [35-59] within the neighborhood60 and older proportion of people with age = [60-older] within the neighborhoodMedian Age This column highlights the median age within a neighbourhoodHigh School or Less proportion of people with only a High School degree or less within the neighborhoodSome College/Associate Degree proportion of people with only associate College degree or less within the neighborhoodCollege Degree proportion of people with only College Degree within the neighborhoodGraduate/Professional Degree proportion of people with only Graduate/Professional Degree within the neighborhoodForeign Born proportion of foreign born people within the neighborhoodEnglish Only proportion of english only speakers people within the neighborhoodSpanish Only proportion of spanish only speakers people within the neighborhoodAsian/Pacific Islander proportion of asian only speakers people within the neighborhoodOther European Languages Only proportion of european languages only speakers people within the neighborhoodOther Languages proportion of other languages only speakers people within the neighborhoodUnits of Housing this column highlights the number of housings for each neighbourhood.Median Year Structure Build this column highlights the median year structure housing for each neighbourhood.Median Rent this column highlights the median housing rent for each neighbourhood.Median Home Value this column highlights the median home value for each neighbourhood.Median Household Income this column highlights the median household income for each neighbourhood.Percent in Poverty this column highlights the % in poverty for each neighbourhood.Unemployment Rate this column highlights the % of unemployment for each neighbourhood.Population Density per Acre this column highlights the number of people per Acre for each neighborhood.With all these variables, we can see each neighborhood in terms of the number and density of the population, household composition, race, age distribution, education, languages spoken, and the economy of the neighborhood. This will be very helpful for answering some of our research questions.
Source of the data set: [https://catalog-next.data.gov/dataset/covid-19-cases-and-deaths-summarized-by-geography]
Column of the data sets:
count number of casesacs_population Populationmultypoligon geo-spacial InformationSource of the data set: [https://data.sfgov.org/COVID-19/COVID-19-Cases-Summarized-by-Date-Transmission-and/tvq9-ec9w]
Next, we will upload a dataset that counts the number of transit stops in SFMTA system. We will use this data to analyze the transit situation by neighborhood.
Source of the data set: [https://catalog.data.gov/dataset/muni-stops].
STOPNAME The names of all the transit stopsshape The point geometry used for mapping featuresAnalysis Neighborhoods This column indicates in which neighborhood of San Franciso the transit stops is.The other columns of this data set are not relevant for our research project. We will only use this data set to see if the density of the number of public transport stops coincides with the neighborhoods with the most crime. This data set will also be useful to analyze the density of the public transport network and find a comparison with the neighbourhoods with the most crime.
#> Reading layer `Analysis Neighborhoods' from data source `/Users/ROUGE/Desktop/STUDY HARD/Master/Data_Science/dsfba_project/data/Analysis Neighborhoods.geojson' using driver `GeoJSON'
#> Simple feature collection with 41 features and 1 field
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -123 ymin: 37.7 xmax: -122 ymax: 37.8
#> geographic CRS: WGS 84
#> Simple feature collection with 41 features and 1 field
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -123 ymin: 37.7 xmax: -122 ymax: 37.8
#> geographic CRS: WGS 84
#> First 10 features:
#> nhood geometry
#> 1 Bayview Hunters Point MULTIPOLYGON (((-122 37.8, ...
#> 2 Bernal Heights MULTIPOLYGON (((-122 37.7, ...
#> 3 Castro/Upper Market MULTIPOLYGON (((-122 37.8, ...
#> 4 Chinatown MULTIPOLYGON (((-122 37.8, ...
#> 5 Excelsior MULTIPOLYGON (((-122 37.7, ...
#> 6 Financial District/South Beach MULTIPOLYGON (((-122 37.8, ...
#> 7 Glen Park MULTIPOLYGON (((-122 37.7, ...
#> 8 Golden Gate Park MULTIPOLYGON (((-122 37.8, ...
#> 9 Haight Ashbury MULTIPOLYGON (((-122 37.8, ...
#> 10 Hayes Valley MULTIPOLYGON (((-122 37.8, ...
#> Coordinate Reference System:
#> User input: WGS 84
#> wkt:
#> GEOGCRS["WGS 84",
#> DATUM["World Geodetic System 1984",
#> ELLIPSOID["WGS 84",6378137,298.257223563,
#> LENGTHUNIT["metre",1]]],
#> PRIMEM["Greenwich",0,
#> ANGLEUNIT["degree",0.0174532925199433]],
#> CS[ellipsoidal,2],
#> AXIS["geodetic latitude (Lat)",north,
#> ORDER[1],
#> ANGLEUNIT["degree",0.0174532925199433]],
#> AXIS["geodetic longitude (Lon)",east,
#> ORDER[2],
#> ANGLEUNIT["degree",0.0174532925199433]],
#> ID["EPSG",4326]]
Source of the data set: [https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h]
This data set contains multi-polygons that are supposed to represent the segmentation of neighbourhoods corresponding to the Analysis Neighborhood variable. We will use this data set extensively for our geospatial analyses.
Here is our cleaned data set which takes into account all crimes from 2003 to 2020. We highlight the first 5 observations. We will use it to look at the trends for crime.
| Incident Date | Incident Time | Incident Day of Week | Incident Category | Incident Description | Resolution | lon | lat | Analysis Neighborhood |
|---|---|---|---|---|---|---|---|---|
| 2020-08-15 | 12:43:00 | Saturday | Assault | Battery | OPEN | 37.71603881888 | -122.4402551358 | Excelsior |
| 2018-01-18 | 19:00:00 | Thursday | Lost Property | Lost Property | OPEN | NA | NA | NA |
| 2020-08-16 | 03:13:00 | Sunday | Assault | Firearm, Discharging in Grossly Negligent Manner | OPEN | 37.75482657771 | -122.3977287339 | Potrero Hill |
| 2020-08-16 | 03:38:00 | Sunday | Malicious Mischief | Malicious Mischief, Breaking Windows | OPEN | 37.76653957530 | -122.4220438145 | Mission |
| 2020-08-15 | 09:40:00 | Saturday | Larceny/Theft | Theft, From Locked Vehicle, >$950 | OPEN | NA | NA | NA |
| 2020-08-16 | 13:40:00 | Sunday | Non-Criminal | Mental Health Detention | OPEN | 37.78404443716 | -122.4037117546 | Financial District |
As you can see here, there are a few NA values for the latitude and longitude variables. For the purpose of creating a time series graph, this is not a problem, and thus we can keep them. However, for geospacial visualization, we’re going create a table where these NA values are removed.
Here is our cleaned data set that counts the number of public transportation stops in San Francisco. We took the time to modify the numbers associated to each neighborhood by changing them with their associated names. We will use this data set to see if there is a correlation between crime and public transportion density.
| Analysis Neighborhoods | number of stops |
|---|---|
| Sunset/Parkside | 263 |
| Bayview Hunters Point | 256 |
| West of Twin Peaks | 236 |
| Financial District | 168 |
| Mission | 149 |
| Castro/UpperMarket | 134 |
This data set highlights the number of crimes by neighborhood for the years 2014 to 2016. We have combined the crime information with our socio-economic data. This dataset will be useful to build a linear regression as well as for the geospatial analysis.
This data set highlights the number of COVID cases by neighborhood. This dataset will be useful to answer our second research questions.
This data set highlights the number of COVID cases in San Francisco. This dataset will be useful to answer our second research questions.
| date | Case Count |
|---|---|
| 2020-03-10 | 6 |
| 2020-03-11 | 9 |
| 2020-03-12 | 6 |
| 2020-03-13 | 16 |
| 2020-03-14 | 10 |
| 2020-03-15 | 11 |
Now that our data has been cleaned up, we can move on to the anaytical parts of our project.
For the first part of the exploratory data analysis, we are going look at the evolution of crime over time. Can we see patterns that recur at certain times of the day, at certain times of the year?
| hour | Number of Crime |
|---|---|
| 18 | 159418 |
| 17 | 153601 |
| 12 | 152153 |
| 19 | 143477 |
| 16 | 142237 |
| 15 | 136090 |
This table shows the number of crimes falling within 1 hour intervals. There are peaks in recorded crime rates in the [3pm-8pm] time interval, at noon and midnight. It is interesting to note that the 3 time intervals that have the highest rates of crime are 6pm, 5pm and 12pm. For instance, this corresponds to moments of the day when workers have breaks or have finished working.
We can also show this same data with a barplot.
We can see that starting from 8:00 a.m., crimes start increasing. We see a peak at 12 o’clock, and then another one during [3pm-8pm] interval . What is interesting to note is that there is less crime at night, since one might think that more crime happens at night because of the dark and nightlife activities.
For the days of the week (Monday, Tuesday, Wednesday, Thursday), we see the same trend: a gradual increase until 6pm, with a sudden peak at noon and a gradual decrease after 6pm. On the other hand, for Friday, Saturday and Sunday the increase lasts until later in the evening, which is preumably because of the more active nightlife.
We are now going to look at the distribution of crimes according to the days of the week. One might think that crimes increase on Fridays and Saturdays because that’s when people go out the most. But is this really the case?
There is definitely more crime on Fridays than on any other day of the week. However, the variation between days is not obvious.
We will now look at the distribution of the number of crimes per month, to see if there was an increase or decrease in some months. As we will refine our analysis later on for 2016, we will analyze the variation in the number of crimes per month for this xear.
We don’t see any particular pattern. There is some variation but it deems to be mostly noise. One would have thought that in months with warm weather we might see an increase in crime because people are going out more and are more active. But the variation between the months is small and inconclusive.
If we look at how the number of crimes by day and by month, we still don’t see any clear pattern and the variation seems to be constant.
We will look at the evolution of the number of crimes per month over a period of 13 years. We will then draw a time graph over the period [2003-2020].
There is minor variation (with greater or lesser intensity) between months over time, and we have added a smoothing line to these temporal data. With this smoothing, we can see a certain sinusoidal variation with a decrease in the smoothed line when we get to the crisis years of 2008, followed by an increase until 2016, when we reach a maximum, then a descent. Note the clear and precise fall when we get to 2020: we’ll come back to this later.
We will now turn our attention to the categories of crime. What type of crime is San Francisco most affected by?
The trends are clear! The vast majority of crime in San Francisco is theft. The Theft bar of the barplot clearly dominates the others. If we also take into account the Motor Vehicle Theft column, which can be considered theft, we can conclude that property crimes constitute the largest percentage of crimes in San Francisco.
This visualization does not show clear trends for all categories, but we can see that, for Larceny/Theft, there is a peak at noon and between 5 and 9 pm.
With these representations of crimes according by days and by category, we see that most crimes are concentrated on Fridays and Saturdays. For assaults, this changes a little, we see more of them on Saturdays and Sundays. But all in all, the majority of crimes take place during the weekend.
Now we will do some geographic visualizations that will allow us to better understand where the places with the most crime in San Francisco are.
The neighbourhoods with the most crime are Tenderloin, South of Market and Mission, along with Financial District and Bayview Hunters Point.